Explorating GLOTREC catalogue¶
Gimena del Rio & Romina De León¶
(HDLAB CONICET)¶
Notebook designed and maintained by Romina De León¶
Goals:¶
- Download data from the GLOTREC repository
- Standardize and export the dataset for exploration and analysis
- Clean and prepare GLOTREC data related to Argentine Textbooks
- Exploring data and relationship between:
- Authors and Publisher
- Publisher and School Subjects
- Publisher, Authors, School Subjects
- Work on similar visualizations as the ones that can be found nowadays in GLOTREC, though improved with a focus on specific periods.
Libraries to use¶
Description:
pandas: for data cleaning, manipulation, and tabular representationnumpy: for efficient numerical and array operationsmatplotlibandseaborn: for statistical and exploratory data visualizationre: for applying regular expressions in text normalizationopenpyxl: for reading and exporting Excel filesnetworkx: for analyzing and visualizing relationships through network graphsunidecode: for removing diacritics and standardizing text encodingsmath: provide mathematical functions defined by C standard
Installation of packages if not already installed¶
In [167]:
# Only needed once to install packages
%pip install -q pandas numpy matplotlib seaborn unidecode openpyxl squarify networkx pyvis
Note: you may need to restart the kernel to use updated packages.
Import necessary libraries¶
In [168]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re, openpyxl, networkx as nx
from unidecode import unidecode
import plotly.express as px
from pyvis.network import Network
import itertools
import math
import plotly.graph_objects as go
Setup visualization aesthetics for plots¶
In [169]:
sns.set_theme(style="whitegrid", palette="mako")
plt.rcParams.update({
"figure.facecolor": "white",
"axes.titlesize": 12,
"axes.labelsize": 10,
"xtick.labelsize": 8,
"ytick.labelsize": 8
})
Read the downloaded Excel file and display first rows¶
In [170]:
df = pd.read_excel(
"data/itbc_export_2025.xlsx",
usecols=lambda c: not c.startswith("Unnamed"),
dtype={"Year": "float"}
)
print(df.info())
display(df.sample(5))
<class 'pandas.core.frame.DataFrame'> RangeIndex: 335 entries, 0 to 334 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 335 non-null object 1 Call Number 332 non-null object 2 GLOTREC|Cat Link 335 non-null object 3 Catalogue 335 non-null object 4 Library Catalogue 335 non-null object 5 Year 335 non-null float64 6 Publisher 335 non-null object 7 Place 335 non-null object 8 Title 335 non-null object 9 Authors 321 non-null object 10 Pages 333 non-null object 11 Format 335 non-null object 12 School Subject 335 non-null object 13 Level of Education 335 non-null object 14 Document Type 335 non-null object 15 Country of Use 335 non-null object dtypes: float64(1), object(15) memory usage: 42.0+ KB None
| ID | Call Number | GLOTREC|Cat Link | Catalogue | Library Catalogue | Year | Publisher | Place | Title | Authors | Pages | Format | School Subject | Level of Education | Document Type | Country of Use | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 169 | 654887500 | RA G-5(1,68) | gei654887500 | GEI | PPN=654887500 | 1968.0 | Kapelusz | Buenos Aires | Geografía dinámica - general, Asia y Africa A.... | Perpillou, Aime | Pernet, L. | Rampa, Alfredo C. | [VI], 328 S. Ill., graph. Darst., Kt. | Book | Geography | ISCED 2 - Lower secondary level | Textbook | Argentina |
| 264 | 655829725 | RA S-18(14,72)1 | gei655829725 | GEI | PPN=655829725 | 1972.0 | Losada | Buenos Aires | Educación democrática 1er año, [Schülerbd.] Jo... | Delfino, Jorge Raúl | 143 S. | Book | Social studies/politics | ISCED 2 - Lower secondary level | Textbook | Argentina |
| 153 | 654844038 | RA RA-45(7,56) | gei654844038 | GEI | PPN=654844038 | 1956.0 | Lasserre | Buenos Aires | Colegial primer libro de lectura corriente Áng... | Raggi, Ángela E. | [X], 104 S. Ill., Kt. | Book | Mother tongue - Readers | ISCED 1 - Primary level | Reading Book | Argentina |
| 13 | 1027054358 | RA S-43(3,2005)8 | gei1027054358 | GEI | PPN=1027054358 | 2005.0 | A-Z editora | Ciudad Autónoma de Buenos Aires | Sociedad en red EGB 3o ciclo 8, [Schülerband] ... | Bruno, Paula | Calvo, Graciela | Castellani, A... | 285 Seiten Illustrationen, Diagramme, Karten | Book | Social studies/politics | History | Geography | ISCED 2 - Lower secondary level | Textbook | Argentina |
| 34 | 1027341152 | RA S-49(1,2013)2 | gei1027341152 | GEI | PPN=1027341152 | 2013.0 | A-Z editora | Ciudad Autónoma de Buenos Aires | Educación cívica 2, [Schülerband] Poder, estad... | Fraga, Norberto E. | Ribas, Gabriel A. | 157 Seiten Illustrationen | Book | Social studies/politics | ISCED 2 - Lower secondary level | Textbook | Argentina |
Normalization of the Publisher column¶
- Clean up spaces and convert to lowercase
- Normalization publishers with a mapping dictionary
- Apply normalization in Publisher column
In [171]:
df['Publisher'] = df['Publisher'].str.strip().str.lower()
mapa_editoriales = {
'a-z editora': 'A-Z Editora',
'a-z ed.': 'A-Z Editora',
'az editora': 'A-Z Editora',
'estrada': 'Estrada',
'estrada secundaria': 'Estrada',
'angel estrada & cía.s.a.-editores': 'Estrada',
'puerto de palos s.a. casa de édiciones': 'Puerto de Palos',
'puerto de palos': 'Puerto de Palos',
'aique primaria': 'Aique',
'aique secundaria': 'Aique',
'aique': 'Aique',
'kapelusz': 'Kapelusz',
'ed. kapelusz': 'Kapelusz',
'kapelusz norma': 'Kapelusz',
'tinta fresca': 'Tinta Fresca',
'doce orcas ediciones': 'Doce Orcas',
'doce orcas ed.': 'Doce Orcas',
'doce orcas': 'Doce Orcas',
'ed. stella': 'Stella',
'ed. atlántida': 'Atlántida',
'losada': 'Losada',
'ed. troquel': 'Troquel',
'imprenta mercur': 'Mercur',
'imprenta de pablo e. coni, especial para obras': 'Coni',
'coni': 'Coni',
'igon': 'Igon',
'igón': 'Igon',
'goethe-inst.': 'Goethe-Institut',
'cesarini': 'Cesarini',
'cesarini hnos. ed.': 'Cesarini',
'producciones mawis': 'Mawis',
'editorial h.m.e.': 'HME',
'imprenta y librería de mayo': 'Librería de Mayo',
'librería del colegio, alsina y bolívar': 'Librería del Colegio',
'cabaut, librería del colegio': 'Librería del Colegio',
'alsina & bolívar, librería del colegio': 'Librería del Colegio',
'librería del colegio': 'Librería del Colegio',
'ed. crespillo': 'Crespillo',
'f. crespillo': 'Crespillo',
'f. crespillo editor': 'Crespillo',
'ed. peuser': 'Peuser',
'peuser': 'Peuser'
}
df['Publisher'] = df['Publisher'].replace(mapa_editoriales).str.title()
Create a function to normalize author names according to specified rules¶
1. Remove accents and extra spaces
2. If there's a comma, we assume "Last, First" format
3. If last name has multiple parts, keep them together
4. Select only the first given name
5. Rebuild normalized name
6. If no comma, just title case the whole name
Apply function to Authors column
In [172]:
def normalizar_autor(nombre):
if not isinstance(nombre, str) or not nombre.strip():
return None
# remove accents and extra spaces
nombre = unidecode(nombre.strip())
# If there's a comma, we assume "Last, First" format
if ',' in nombre:
apellido, resto = nombre.split(',', 1)
apellido = apellido.strip()
# if last name has multiple parts, keep them together
apellido = re.sub(r'\s+', ' ', apellido)
# select only the first given name
resto = resto.strip()
primer_nombre = resto.split()[0] if resto else ''
# rebuild normalized name
nombre_norm = f"{apellido.title()}, {primer_nombre.title()}"
else:
# if no comma, just title case the whole name
nombre_norm = nombre.title()
return nombre_norm.strip()
# Apply function to authors column ---
df['Authors'] = df['Authors'].fillna('').str.split('|').apply(lambda lst: [normalizar_autor(a) for a in lst if a])
Graph Secction¶
Graph Publishers by Number of Books¶
In [173]:
top = df['Publisher'].value_counts().head(25)
plt.figure(figsize=(14, 8))
sns.barplot(y=top.index, x=top.values, hue=top.index, palette="Set3", legend=False)
plt.xlabel('Books Count')
plt.ylabel('Publisher')
plt.title('Publishers by Number of Books')
plt.tight_layout()
plt.show()
Graph Publishers by Number of Books and Level of Education¶
In [174]:
sns.set(style="whitegrid", rc={
"axes.facecolor": "white",
"figure.facecolor": "white",
"axes.edgecolor": "lightgray",
"grid.color": "lightgray"
})
# temporal dataframe with counts
df_temp = (
df.groupby(['Level of Education', 'Document Type', 'Publisher'])
.size()
.reset_index(name='Books Count')
)
df_temp = df_temp[df_temp['Books Count'] > 3] #filter publishers with more than 3 books
# create facet grid
g = sns.FacetGrid(
df_temp,
col='Level of Education',
col_wrap=2,
sharex=False,
sharey=False,
height=4.5,
aspect=1.3,
margin_titles=True,
despine=False,
)
# facet_grid barplot
g.map_dataframe(
sns.barplot,
x='Books Count',
y='Publisher',
hue='Publisher',
palette='Set3',
legend=False,
dodge=False,
edgecolor='gray',
errorbar=None
)
g.set_titles(col_template="{col_name}", fontsize=11, fontweight='bold', pad=10)
# title styling and adjustments
for ax in g.axes.flat:
if ax.get_legend():
ax.legend_.remove()
ax.tick_params(axis='y', labelsize=9)
ax.tick_params(axis='x', labelsize=8)
ax.grid(True, axis='x', linestyle=':', linewidth=0.5)
# remove duplicate legends and add a single legend at the bottom
handles, labels = g.axes.flat[0].get_legend_handles_labels()
if handles:
g.fig.legend(handles, labels, loc='lower center', ncol=4, fontsize=9, frameon=False)
# global labels
g.set_axis_labels('Books Count', 'Publisher')
plt.subplots_adjust(hspace=0.4, wspace=0.4, bottom=0.25)
plt.show()
Heatmap of Publishers vs School Subjects¶
In [175]:
plt.figure(figsize=(18, 12))
(df
.explode('Publisher')
.query("`School Subject` != 'German taught in non-German-speaking countries'")
.pipe(lambda d: sns.heatmap(
pd.crosstab(d['School Subject'], d['Publisher']),
#cmap='PiYG',
cmap='YlGnBu',
annot=False,
fmt='d',
annot_kws={"size": 10},
linewidth=0.5,
linecolor='white',
cbar_kws={'label': 'Books Count'},
mask=(pd.crosstab(d['School Subject'], d['Publisher']) <= 3)
)))
plt.title('Count of Books by Publisher and Subject', fontsize=18, pad=20, weight='bold')
plt.xlabel('Publisher', fontsize=14, labelpad=12)
plt.ylabel('School Subject', fontsize=14, labelpad=12)
plt.xticks(rotation=60, ha='right', fontsize=12)
plt.yticks(rotation=0, fontsize=12)
sns.despine(left=True, bottom=True)
plt.tight_layout()
plt.show()
In [176]:
# Histplot of books by Year and Publisher
plt.figure(figsize=(14, 8))
colors_palette = sns.color_palette("Paired", 12) + sns.color_palette("Dark2", 13)
#sns.set(style="ticks", palette="colors_palette")
ax = sns.histplot(df_exploded[df_exploded['Publisher'].isin(df_exploded['Publisher'].value_counts().head(25).index)],
x='Year',
hue='Publisher',
multiple='stack',
palette=colors_palette,
bins=40,
legend=True
)
# Force legend customization
leg = ax.get_legend()
if leg:
leg.set_title("Publisher")
leg._loc = 2
leg.set_bbox_to_anchor((1.05, 1))
leg.set_frame_on(False)
for t in leg.texts:
t.set_fontsize(9)
leg.set_title("Publisher")
else:
print("⚠️ Legend not found.")
plt.title('Distribution of Books by Year and Publisher (before 1900)')
plt.xlabel('Year of Publication')
plt.ylabel('Book Count')
sns.despine()
plt.tight_layout(rect=[0, 0, 0.85, 1])
plt.show()
In [177]:
# Histplot of books by Year and Publisher
plt.figure(figsize=(14, 8))
colors_palette = sns.color_palette("Paired", 12) + sns.color_palette("Dark2", 13)
#sns.set(style="ticks", palette="colors_palette")
ax = sns.histplot(df_exploded[df_exploded['Publisher'].isin(df_exploded['Publisher'].value_counts().head(25).index)],
x='Year',
hue='Publisher',
multiple='stack',
palette=colors_palette,
bins=40,
legend=True
)
# Force legend customization
leg = ax.get_legend()
if leg:
leg.set_title("Publisher")
leg._loc = 2
leg.set_bbox_to_anchor((1.05, 1))
leg.set_frame_on(False)
for t in leg.texts:
t.set_fontsize(9)
leg.set_title("Publisher")
else:
print("⚠️ Legend not found.")
plt.title('Distribution of Books by Year and Publisher (before 1900)')
plt.xlabel('Year of Publication')
plt.ylabel('Book Count')
sns.despine()
plt.tight_layout(rect=[0, 0, 0.85, 1])
plt.show()
In [178]:
# Heatmap of Publishers vs Level of Education
df['Level of Education'] = df['Level of Education'].fillna('').str.split('|').str[0]
plt.figure(figsize=(20, 12))
(df
.explode('Publisher')
#.query("`School Subject` != 'German taught in non-German-speaking countries'")
.pipe(lambda d: pd.crosstab(d['Level of Education'], d['Publisher']))
.pipe(lambda ctab: ctab[ctab >= 4].dropna(how='all').dropna(axis=1, how='all'))
.pipe(lambda filtered: sns.heatmap(
filtered,
#cmap='PiYG',
cmap='YlGnBu',
annot=False,
fmt='d',
annot_kws={"size": 8},
linewidth=0.5,
linecolor='white',
cbar_kws={'label': 'Books Count'})
))
plt.title('Count of Books by Publisher and Level of Education', fontsize=18, pad=20, weight='bold')
plt.xlabel('Publisher', fontsize=13, labelpad=12)
plt.ylabel('Level of Education', fontsize=13, labelpad=12)
plt.xticks(rotation=60, ha='right', fontsize=12)
plt.yticks(rotation=0, fontsize=12)
sns.despine(left=True, bottom=True)
plt.tight_layout()
plt.show()
In [179]:
# Heatmap of Publishers vs Document Type
df['Document Type'] = df['Document Type'].fillna('').str.split('|').str[0]
plt.figure(figsize=(16, 10))
(df
.explode('Publisher')
#.query("`School Subject` != 'German taught in non-German-speaking countries'")
.pipe(lambda d: sns.heatmap(
pd.crosstab(d['Document Type'], d['Publisher']),
#cmap='PiYG',
cmap='YlGnBu',
annot=False,
fmt='d',
linecolor='gray',
cbar_kws={'label': 'Books Count'},
mask=(pd.crosstab(d['Document Type'], d['Publisher']) < 4)
)))
plt.title('Count of Books by Publisher and Document Type', fontsize=18, pad=20, weight='bold')
plt.xlabel('Publisher', fontsize=13, labelpad=10)
plt.ylabel('Document Type', fontsize=13, labelpad=10)
plt.xticks(rotation=60, ha='right', fontsize=9)
plt.yticks(rotation=0, fontsize=10)
sns.despine(left=True, bottom=True)
plt.tight_layout()
plt.show()
Sakney plot for Publisher-Author collaborations¶
Overview of collaborations between publisher and author in the catalog
In [180]:
# Count summary
df_flow = (df.explode('Authors')
.groupby(['Publisher', 'Authors'])
.size()
.reset_index(name='count'))
# Filter top
include_publishers = []
include_authors = []
if include_publishers:
df_flow = df_flow[df_flow['Publisher'].isin(include_publishers)]
if include_authors:
df_flow = df_flow[df_flow['Authors'].isin(include_authors)]
df_flow = df_flow[df_flow['count'] >= 2] # filter publishers with more or equal than 2 publications
# create nodes and links
publishers = df_flow['Publisher'].dropna().unique().tolist()
authors = df_flow['Authors'].dropna().unique().tolist()
all_nodes = publishers + authors
source = df_flow['Publisher'].apply(lambda x: all_nodes.index(x))
target = df_flow['Authors'].apply(lambda x: all_nodes.index(x))
value = df_flow['count']
# colors for each group
publisher_color = "#4C72B0"
author_color = "#9909A9"
colors = [publisher_color] * len(publishers) + [author_color] * len(authors)
# create Sankey
fig = go.Figure(go.Sankey(
node=dict(
label=all_nodes,
pad=15,
thickness=15,
color=colors,
line=dict(color="white", width=0.5)
),
link=dict(
source=source,
target=target,
value=value,
color="rgba(150,150,150,0.3)"
)
))
fig.update_layout(
title_text=f"Flow of Publications between Publishers and Authors<br><sup>{len(publishers)} publishers — {len(authors)} authors</sup>",
font_size=10,
height=700
)
fig.show(renderer='notebook')
Alluvial plot of Publishers vs Authors for decades¶
In [181]:
fig = px.parallel_categories(
df.explode('Authors')
.query('(Year >= 1860) & (Year < 1900)')
.groupby(['Publisher', 'Authors'])
.size()
.reset_index(name='count'),
dimensions=['Publisher', 'Authors'],
color='count',
color_continuous_scale=px.colors.sequential.Viridis,
labels={'Publisher': 'Publisher', 'Authors': 'Author', 'count': 'Books Count'},
)
fig.update_layout(
title=dict(
text="Publisher–Author Collaborations (1860–1900)",
x=0.5,
xanchor='center',
font=dict(size=18, family='Arial Black')
),
font=dict(size=11, family='Arial'),
coloraxis_colorbar=dict(
title="Books Count",
tickfont=dict(size=10)
),
paper_bgcolor='white',
plot_bgcolor='white',
margin=dict(l=60, r=60, t=80, b=50),
height=700,
dragmode=False,
coloraxis_showscale=False
)
fig.show(renderer='notebook')
In [182]:
fig = px.parallel_categories(
df.explode('Authors')
.query('(Year >= 1900) & (Year < 1940)')
.groupby(['Publisher', 'Authors'])
.size()
.reset_index(name='count'),
dimensions=['Publisher', 'Authors'],
color='count',
color_continuous_scale=px.colors.sequential.Cividis,
labels={'Publisher': 'Publisher', 'Authors': 'Author', 'count': 'Books Count'},
)
fig.update_layout(
title=dict(
text="Publisher–Author Collaborations (1900–1940)",
x=0.5,
xanchor='center',
font=dict(size=18, family='Arial Black')
),
font=dict(size=11, family='Arial'),
coloraxis_colorbar=dict(
title="Books Count",
tickfont=dict(size=10)
),
paper_bgcolor='white',
plot_bgcolor='white',
margin=dict(l=60, r=60, t=80, b=50),
height=700,
dragmode=False,
coloraxis_showscale=False
)
fig.show(renderer='notebook')
In [183]:
fig = px.parallel_categories(
df.explode('Authors')
.query('(Year >= 1940) & (Year < 1980)')
.groupby(['Publisher', 'Authors'])
.size()
.reset_index(name='count')
.query('count >= 2'),
dimensions=['Publisher', 'Authors'],
color='count',
color_continuous_scale=px.colors.sequential.Plasma,
labels={'Publisher': 'Publisher', 'Authors': 'Author', 'count': 'Books Count'},
)
fig.update_layout(
title=dict(
text="Publisher–Author Collaborations (1940–1980)",
x=0.5,
xanchor='center',
font=dict(size=18, family='Arial Black')
),
font=dict(size=11, family='Arial'),
coloraxis_colorbar=dict(
title="Books Count",
tickfont=dict(size=10)
),
paper_bgcolor='white',
plot_bgcolor='white',
margin=dict(l=60, r=60, t=80, b=50),
height=700,
dragmode=False,
coloraxis_showscale=False
)
fig.show(renderer='notebook')
In [184]:
fig = px.parallel_categories(
df.explode('Authors')
.query('(Year >= 1980) & (Year < 2000)')
.groupby(['Publisher', 'Authors'])
.size()
.reset_index(name='count')
.query('count >= 2'),
dimensions=['Publisher', 'Authors'],
color='count',
color_continuous_scale=px.colors.sequential.Plasma,
labels={'Publisher': 'Publisher', 'Authors': 'Author', 'count': 'Books Count'},
)
fig.update_layout(
title=dict(
text="Publisher–Author Collaborations (1980–2000)",
x=0.5,
xanchor='center',
font=dict(size=18, family='Arial Black')
),
font=dict(size=11, family='Arial'),
coloraxis_colorbar=dict(
title="Books Count",
tickfont=dict(size=10)
),
paper_bgcolor='white',
plot_bgcolor='white',
margin=dict(l=60, r=60, t=80, b=50),
height=700,
dragmode=False,
coloraxis_showscale=False
)
fig.show(renderer='notebook')
In [185]:
fig = px.parallel_categories(
df.explode('Authors')
.query('(Year >= 2000) & (Year < 2010)')
.groupby(['Publisher', 'Authors'])
.size()
.reset_index(name='count')
.query('count >= 2'),
dimensions=['Publisher', 'Authors'],
color='count',
color_continuous_scale=px.colors.sequential.Plasma,
labels={'Publisher': 'Publisher', 'Authors': 'Author', 'count': 'Books Count'},
)
fig.update_layout(
title=dict(
text="Publisher–Author Collaborations (2000–2010)",
x=0.5,
xanchor='center',
font=dict(size=18, family='Arial Black')
),
font=dict(size=11, family='Arial'),
coloraxis_colorbar=dict(
title="Books Count",
tickfont=dict(size=10)
),
paper_bgcolor='white',
plot_bgcolor='white',
margin=dict(l=60, r=60, t=80, b=50),
height=700,
dragmode=False,
coloraxis_showscale=False
)
fig.show(renderer='notebook')
In [186]:
fig = px.parallel_categories(
df.explode('Authors')
.query('(Year >= 2010)')
.groupby(['Publisher', 'Authors'])
.size()
.reset_index(name='count')
.query('count >= 2'),
dimensions=['Publisher', 'Authors'],
color='count',
color_continuous_scale=px.colors.sequential.Plasma,
labels={'Publisher': 'Publisher', 'Authors': 'Author', 'count': 'Books Count'},
)
fig.update_layout(
title=dict(
text="Publisher–Author Collaborations after 2010",
x=0.5,
xanchor='center',
font=dict(size=18, family='Arial Black')
),
font=dict(size=11, family='Arial'),
coloraxis_colorbar=dict(
title="Books Count",
tickfont=dict(size=10)
),
paper_bgcolor='white',
plot_bgcolor='white',
margin=dict(l=60, r=60, t=80, b=50),
height=700,
dragmode=False,
coloraxis_showscale=False
)
fig.show(renderer='notebook')
Heatmap of relationship between School Subject and Authors¶
In [187]:
plt.figure(figsize=(16, 10))
""" vmax = 25 """
(df
.explode('Authors')
.query("`School Subject` != 'German taught in non-German-speaking countries'")
.pipe(lambda d: d[
d['School Subject'].isin(d['School Subject'].value_counts().nlargest(25).index) &
d['Authors'].isin(d['Authors'].value_counts().nlargest(25).index)
])
.pipe(lambda d: sns.heatmap(
pd.crosstab(d['School Subject'], d['Authors']),
cmap='cubehelix_r',
annot=False,
fmt='d',
linecolor='gray',
cbar_kws={'label': 'Books Count'},
#vmax=vmax
))
)
plt.title('Relationship between School Subjects and Authors', fontsize=18, pad=20, weight='bold')
plt.xlabel('Authors', fontsize=13, labelpad=10)
plt.ylabel('School Subject', fontsize=13, labelpad=10)
#
plt.xticks(rotation=45, ha='right', fontsize=9)
plt.yticks(rotation=10, fontsize=10)
#
sns.despine(left=True, bottom=True)
#
plt.tight_layout()
plt.show()
New column to decades¶
In [188]:
# column with decades of publication
df['YearInterval'] = pd.cut(df['Year'], bins=list(range(1860, 2021, 10)))
print(df['YearInterval'].value_counts().sort_index())
YearInterval (1860, 1870] 1 (1870, 1880] 1 (1880, 1890] 1 (1890, 1900] 6 (1900, 1910] 4 (1910, 1920] 6 (1920, 1930] 4 (1930, 1940] 4 (1940, 1950] 10 (1950, 1960] 34 (1960, 1970] 63 (1970, 1980] 23 (1980, 1990] 43 (1990, 2000] 14 (2000, 2010] 49 (2010, 2020] 72 Name: count, dtype: int64
Time series of book counts by authors¶
In [189]:
df_exploded = df.explode("Authors")
# Count books by YearInterval, Publisher, School Subject, and Authors
counts = (
df_exploded.groupby(["YearInterval", "Publisher", "School Subject", "Authors"])
.size()
.reset_index(name="Count")
)
# transform YearInterval to its midpoint for plotting
counts['YearMid'] = counts['YearInterval'].apply(lambda x: x.mid if pd.notnull(x) else None)
# top authors for plotting
top_authors = (
counts.groupby("Authors")["Count"].sum().nlargest(25).index
)
counts_top = counts[counts["Authors"].isin(top_authors)]
plt.figure(figsize=(16, 10))
sns.lineplot(
data=counts_top, # filter to top authors
x="YearMid",
y="Count",
hue="Authors",
marker="o",
linewidth=2,
palette="tab20",
)
# Labels and style
plt.title("Books Count over Time by Top Authors", pad=15, weight="bold")
plt.xlabel("Year (midpoint of interval)")
plt.ylabel("Number of Books")
plt.legend(title="Authors", bbox_to_anchor=(1.05, 1), loc="upper left", frameon=False)
sns.despine()
plt.tight_layout()
plt.show()
C:\Users\Adm\AppData\Local\Temp\ipykernel_6312\4095966031.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
Network Graph¶
Network graph of publishers and decades¶
In [190]:
# Build edgelist for decades and publishers
edgelist = (
df_exploded.groupby(['YearInterval', 'Publisher'])
.size()
.reset_index(name='weight')
)
# build graph
G = nx.from_pandas_edgelist(
edgelist,
source="YearInterval",
target="Publisher",
edge_attr="weight"
)
# Graph visualization
plt.figure(figsize=(12,12))
pos = nx.spring_layout(G, k=1.0, iterations=50, seed=42)
# Edge weights
weights = [G[u][v]['weight']*0.5 for u,v in G.edges()]
# colors for decades vs publishers
node_colors = []
for node in G.nodes():
if isinstance(node, (int, float)):
node_colors.append('lightgreen') # Decade
elif node in df_exploded['Publisher'].unique():
node_colors.append('lightblue') # Publisher
else:
node_colors.append('lightcoral') # Author
# proportional node sizes
node_sizes = [100 + 50*G.degree(n) for n in G.nodes()]
nx.draw(
G, pos, with_labels=True,
node_color=node_colors,
node_size=node_sizes,
edge_color='gray',
width=weights,
font_size=9
)
plt.title("Relationship between Publishers and Decades", fontsize=14)
plt.show()
C:\Users\Adm\AppData\Local\Temp\ipykernel_6312\4158794023.py:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
Graph tripartite, Decade ↔ Publisher ↔ Author¶
In [191]:
# prepare data for network graph
df_exploded = (
df.explode('Authors')
.dropna(subset=['Authors'])
.query("Authors != ''")
)
# build edgelist for Decade ↔ Publisher
edges_decade_publisher = (
df_exploded.groupby(['YearInterval', 'Publisher'])
.size()
.reset_index(name='weight')
)
# build edgelist for Publisher ↔ Author
edges_publisher_author = (
df_exploded.groupby(['Publisher', 'Authors'])
.size()
.reset_index(name='weight')
)
common_publishers = set(edges_decade_publisher['Publisher']).intersection(edges_publisher_author['Publisher'])
edges_decade_publisher = edges_decade_publisher[
edges_decade_publisher['Publisher'].isin(common_publishers)
]
edges_publisher_author = edges_publisher_author[
edges_publisher_author['Publisher'].isin(common_publishers)
]
# graph tripartite
G = nx.Graph()
# new edges decade ↔ Publisher
for _, row in edges_decade_publisher.iterrows():
G.add_edge(row['YearInterval'], row['Publisher'], weight=row['weight'])
# new edges Publisher ↔ Author
for _, row in edges_publisher_author.iterrows():
G.add_edge(row['Publisher'], row['Authors'], weight=row['weight'])
# delete isolated nodes
G.remove_nodes_from(list(nx.isolates(G)))
# graph visualization
plt.figure(figsize=(20,20))
pos = nx.spring_layout(G, k=2.5, iterations=250, seed=42)
# weights for edges
pesos = [G[u][v]['weight']*0.1 for u,v in G.edges()]
# colors for node types
node_colors = []
for node in G.nodes():
if isinstance(node, (int, float)):
node_colors.append('lightcoral') # Decade
elif node in df_exploded['Publisher'].unique():
node_colors.append('lightblue') # Publisher
else:
node_colors.append('lightgreen') # Author
node_sizes = [100 + 50*G.degree(n) for n in G.nodes()]
nx.draw(
G, pos, with_labels=True,
node_color=node_colors,
node_size=node_sizes,
edge_color='gray',
width=pesos
)
plt.title("Graph tripartite, Decade ↔ Publisher ↔ Author")
plt.show()
C:\Users\Adm\AppData\Local\Temp\ipykernel_6312\1638740445.py:10: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
In [192]:
print(len(G.nodes()), "nodes and", len(G.edges()), "edges")
391 nodes and 884 edges
Network graph of Authors and Publishers¶
In [193]:
# prepare data for network graph
df_exploded = (
df.explode('Authors')
.dropna(subset=['Authors'])
.query("Authors != ''")
)
df_network = df_exploded[['Publisher', 'Authors']].dropna().copy()
# build graph bipartite Publisher ↔ Author
G = nx.from_pandas_edgelist(
df_network,
source='Publisher',
target='Authors',
create_using=nx.Graph()
)
# graph visualization focusing on top nodes
node_degrees = dict(G.degree())
threshold = sorted(node_degrees.values(), reverse=True)[25] # nodes with degree >= threshold
nodes_to_draw = [node for node, degree in node_degrees.items() if degree >= threshold or node in df_network]
subgraph = G.subgraph(nodes_to_draw)
plt.figure(figsize=(15, 10))
pos = nx.spring_layout(subgraph, k=0.8, iterations=20) # layout algorithm
# split nodes by type for coloring
publishers_nodes = [n for n in subgraph.nodes() if n in df_network['Publisher'].unique()]
authors_nodes = [n for n in subgraph.nodes() if n in df_network['Authors'].explode().unique()]
# draw nodes
nx.draw_networkx_nodes(
subgraph, pos,
nodelist=publishers_nodes,
node_color='skyblue',
node_size=800,
label='Publishers'
)
nx.draw_networkx_nodes(
subgraph, pos,
nodelist=authors_nodes,
node_color='lightcoral',
node_size=200,
label='Authors'
)
# drwar edges y labels
nx.draw_networkx_edges(subgraph, pos, alpha=0.5)
nx.draw_networkx_labels(subgraph, pos, font_size=12)
plt.title('Network of Authors and Publishers (Top 25 nodes)', fontsize=16, weight='bold')
plt.legend()
plt.tight_layout()
plt.axis('off')
plt.show()
Network graph by time lapse¶
In [194]:
# prepare data for network graph
df_exploded = df.explode('Authors')
df_network = df_exploded[
df_exploded['Year'].between(1860, 1940)
][['Year', 'Publisher', 'Authors']].dropna(subset=['Authors']).copy()
# build graph bipartite Publisher ↔ Author
G = nx.from_pandas_edgelist(
df_network,
source='Publisher',
target='Authors',
create_using=nx.Graph()
)
# graph visualization focusing on top nodes
node_degrees = dict(G.degree())
threshold = sorted(node_degrees.values(), reverse=True)[25] # nodes with degree >= threshold
nodes_to_draw = [node for node, degree in node_degrees.items() if degree >= threshold or node in df_network]
subgraph = G.subgraph(nodes_to_draw)
plt.figure(figsize=(15, 10))
pos = nx.spring_layout(subgraph, k=0.8, iterations=20) # layout algorithm
# split nodes by type for coloring
publishers_nodes = [n for n in subgraph.nodes() if n in df_network['Publisher'].unique()]
authors_nodes = [n for n in subgraph.nodes() if n in df_network['Authors'].explode().unique()]
# draw nodes
nx.draw_networkx_nodes(
subgraph, pos,
nodelist=publishers_nodes,
node_color='skyblue',
node_size=800,
label='Publishers'
)
nx.draw_networkx_nodes(
subgraph, pos,
nodelist=authors_nodes,
node_color='lightcoral',
node_size=200,
label='Authors'
)
# drwar edges y labels
nx.draw_networkx_edges(subgraph, pos, alpha=0.5)
nx.draw_networkx_labels(subgraph, pos, font_size=12)
plt.title('Network of Authors and Publishers (Top 25 nodes) between 1860-1940', fontsize=16, weight='bold')
plt.legend()
plt.tight_layout()
plt.axis('off')
plt.show()
In [195]:
# prepare data for network graph
df_exploded = df.explode('Authors')
df_network = df_exploded[
df_exploded['Year'].between(1940, 1960)
][['Year', 'Publisher', 'Authors']].dropna(subset=['Authors']).copy()
# build graph bipartite Publisher ↔ Author
G = nx.from_pandas_edgelist(
df_network,
source='Publisher',
target='Authors',
create_using=nx.Graph()
)
# graph visualization focusing on top nodes
node_degrees = dict(G.degree())
threshold = sorted(node_degrees.values(), reverse=True)[25] # nodes with degree >= threshold
nodes_to_draw = [node for node, degree in node_degrees.items() if degree >= threshold or node in df_network]
subgraph = G.subgraph(nodes_to_draw)
plt.figure(figsize=(15, 10))
pos = nx.spring_layout(subgraph, k=0.8, iterations=20) # layout algorithm
# split nodes by type for coloring
publishers_nodes = [n for n in subgraph.nodes() if n in df_network['Publisher'].unique()]
authors_nodes = [n for n in subgraph.nodes() if n in df_network['Authors'].explode().unique()]
# draw nodes
nx.draw_networkx_nodes(
subgraph, pos,
nodelist=publishers_nodes,
node_color='skyblue',
node_size=800,
label='Publishers'
)
nx.draw_networkx_nodes(
subgraph, pos,
nodelist=authors_nodes,
node_color='lightcoral',
node_size=200,
label='Authors'
)
# drwar edges y labels
nx.draw_networkx_edges(subgraph, pos, alpha=0.5)
nx.draw_networkx_labels(subgraph, pos, font_size=12)
plt.title('Network of Authors and Publishers (Top 25 nodes) between 1940-1960', fontsize=16, weight='bold')
plt.legend()
plt.tight_layout()
plt.axis('off')
plt.show()
In [196]:
# prepare data for network graph
df_exploded = df.explode('Authors')
df_network = df_exploded[
df_exploded['Year'].between(1960, 1980)
][['Year', 'Publisher', 'Authors']].dropna(subset=['Authors']).copy()
# build graph bipartite Publisher ↔ Author
G = nx.from_pandas_edgelist(
df_network,
source='Publisher',
target='Authors',
create_using=nx.Graph()
)
# graph visualization focusing on top nodes
node_degrees = dict(G.degree())
threshold = sorted(node_degrees.values(), reverse=True)[25] # nodes with degree >= threshold
nodes_to_draw = [node for node, degree in node_degrees.items() if degree >= threshold or node in df_network]
subgraph = G.subgraph(nodes_to_draw)
plt.figure(figsize=(15, 10))
pos = nx.spring_layout(subgraph, k=0.8, iterations=20) # layout algorithm
# split nodes by type for coloring
publishers_nodes = [n for n in subgraph.nodes() if n in df_network['Publisher'].unique()]
authors_nodes = [n for n in subgraph.nodes() if n in df_network['Authors'].explode().unique()]
# draw nodes
nx.draw_networkx_nodes(
subgraph, pos,
nodelist=publishers_nodes,
node_color='skyblue',
node_size=800,
label='Publishers'
)
nx.draw_networkx_nodes(
subgraph, pos,
nodelist=authors_nodes,
node_color='lightcoral',
node_size=200,
label='Authors'
)
# drwar edges y labels
nx.draw_networkx_edges(subgraph, pos, alpha=0.5)
nx.draw_networkx_labels(subgraph, pos, font_size=12)
plt.title('Network of Authors and Publishers (Top 25 nodes) between 1960-1980', fontsize=16, weight='bold')
plt.legend()
plt.tight_layout()
plt.axis('off')
plt.show()
In [197]:
# prepare data for network graph
df_exploded = df.explode('Authors')
df_network = df_exploded[
df_exploded['Year'].between(1980, 2000)
][['Year', 'Publisher', 'Authors']].dropna(subset=['Authors']).copy()
# build graph bipartite Publisher ↔ Author
G = nx.from_pandas_edgelist(
df_network,
source='Publisher',
target='Authors',
create_using=nx.Graph()
)
# graph visualization focusing on top nodes
node_degrees = dict(G.degree())
threshold = sorted(node_degrees.values(), reverse=True)[25] # nodes with degree >= threshold
nodes_to_draw = [node for node, degree in node_degrees.items() if degree >= threshold or node in df_network]
subgraph = G.subgraph(nodes_to_draw)
plt.figure(figsize=(15, 10))
pos = nx.spring_layout(subgraph, k=0.8, iterations=20) # layout algorithm
# split nodes by type for coloring
publishers_nodes = [n for n in subgraph.nodes() if n in df_network['Publisher'].unique()]
authors_nodes = [n for n in subgraph.nodes() if n in df_network['Authors'].explode().unique()]
# draw nodes
nx.draw_networkx_nodes(
subgraph, pos,
nodelist=publishers_nodes,
node_color='skyblue',
node_size=800,
label='Publishers'
)
nx.draw_networkx_nodes(
subgraph, pos,
nodelist=authors_nodes,
node_color='lightcoral',
node_size=200,
label='Authors'
)
# drwar edges y labels
nx.draw_networkx_edges(subgraph, pos, alpha=0.5)
nx.draw_networkx_labels(subgraph, pos, font_size=12)
plt.title('Network of Authors and Publishers (Top 25 nodes) between 1980-2000', fontsize=16, weight='bold')
plt.legend()
plt.tight_layout()
plt.axis('off')
plt.show()
In [198]:
# prepare data for network graph
df_exploded = df.explode('Authors')
df_network = df_exploded[
df_exploded['Year'].between(2000, 2010)
][['Year', 'Publisher', 'Authors']].dropna(subset=['Authors']).copy()
# build graph bipartite Publisher ↔ Author
G = nx.from_pandas_edgelist(
df_network,
source='Publisher',
target='Authors',
create_using=nx.Graph()
)
# graph visualization focusing on top nodes
node_degrees = dict(G.degree())
threshold = sorted(node_degrees.values(), reverse=True)[25] # nodes with degree >= threshold
nodes_to_draw = [node for node, degree in node_degrees.items() if degree >= threshold or node in df_network]
subgraph = G.subgraph(nodes_to_draw)
plt.figure(figsize=(15, 10))
pos = nx.spring_layout(subgraph, k=0.8, iterations=20) # layout algorithm
# split nodes by type for coloring
publishers_nodes = [n for n in subgraph.nodes() if n in df_network['Publisher'].unique()]
authors_nodes = [n for n in subgraph.nodes() if n in df_network['Authors'].explode().unique()]
# draw nodes
nx.draw_networkx_nodes(
subgraph, pos,
nodelist=publishers_nodes,
node_color='skyblue',
node_size=800,
label='Publishers'
)
nx.draw_networkx_nodes(
subgraph, pos,
nodelist=authors_nodes,
node_color='lightcoral',
node_size=200,
label='Authors'
)
# drwar edges y labels
nx.draw_networkx_edges(subgraph, pos, alpha=0.5)
nx.draw_networkx_labels(subgraph, pos, font_size=12)
plt.title('Network of Authors and Publishers (Top 25 nodes) between 2000 - 2010', fontsize=16, weight='bold')
plt.legend()
plt.tight_layout()
plt.axis('off')
plt.show()
In [199]:
# prepare data for network graph
df_exploded = df.explode('Authors')
df_network = df_exploded[
df_exploded['Year'].between(2010, 2020)
][['Year', 'Publisher', 'Authors']].dropna(subset=['Authors']).copy()
# build graph bipartite Publisher ↔ Author
G = nx.from_pandas_edgelist(
df_network,
source='Publisher',
target='Authors',
create_using=nx.Graph()
)
# graph visualization focusing on top nodes
node_degrees = dict(G.degree())
threshold = sorted(node_degrees.values(), reverse=True)[25] # nodes with degree >= threshold
nodes_to_draw = [node for node, degree in node_degrees.items() if degree >= threshold or node in df_network]
subgraph = G.subgraph(nodes_to_draw)
plt.figure(figsize=(15, 10))
pos = nx.spring_layout(subgraph, k=0.8, iterations=20) # layout algorithm
# split nodes by type for coloring
publishers_nodes = [n for n in subgraph.nodes() if n in df_network['Publisher'].unique()]
authors_nodes = [n for n in subgraph.nodes() if n in df_network['Authors'].explode().unique()]
# draw nodes
nx.draw_networkx_nodes(
subgraph, pos,
nodelist=publishers_nodes,
node_color='skyblue',
node_size=800,
label='Publishers'
)
nx.draw_networkx_nodes(
subgraph, pos,
nodelist=authors_nodes,
node_color='lightcoral',
node_size=200,
label='Authors'
)
# drwar edges y labels
nx.draw_networkx_edges(subgraph, pos, alpha=0.5)
nx.draw_networkx_labels(subgraph, pos, font_size=12)
plt.title('Network of Authors and Publishers (Top 25 nodes) after 2010', fontsize=16, weight='bold')
plt.legend()
plt.tight_layout()
plt.axis('off')
plt.show()
Graph with Plotly Express: Animated Bar Chart of Subject Distribution by Publisher over Decades¶
In [200]:
fig = px.bar(
df.groupby([
'YearInterval',
df['School Subject'].str.split().str.slice(0, 3).str.join(' '), # cut subject to first 3 words
'Publisher'
]).size().reset_index(name='Books Count'),
x="School Subject",
y="Books Count",
color="Publisher",
animation_frame="YearInterval", # decade animation
title="Subject Distribution by Publisher over Decades",
category_orders={"YearInterval": sorted(df['YearInterval'].unique())}
)
# rotate x-axis labels to avoid overlap
fig.update_xaxes(
tickangle=45,
automargin=True
)
# move menu and slider down to avoid overlap with x-axis labels
fig.update_layout(
margin=dict(b=100), # move bottom margin
updatemenus=[{
"type": "buttons",
"showactive": True,
"x": -0.05, # move buttons to the left
"y": -0.35, # move buttons
"xanchor": "left",
"yanchor": "top"
}],
sliders=[{
"x": 0.1, # move slider
"y": -0.55, # move slider down
"xanchor": "left",
"yanchor": "top"
}]
)
fig.show(renderer='notebook')
C:\Users\Adm\AppData\Local\Temp\ipykernel_6312\2929836625.py:2: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
Histplot of books by Year and Publisher¶
In [201]:
# Histplot of books by Year and Publisher
plt.figure(figsize=(14, 8))
colors_palette = sns.color_palette("Paired", 12) + sns.color_palette("Dark2", 13)
#sns.set(style="ticks", palette="colors_palette")
ax = sns.histplot(df_exploded[df_exploded['Publisher'].isin(df_exploded['Publisher'].value_counts().head(25).index)],
x='Year',
hue='Publisher',
multiple='stack',
palette=colors_palette,
bins=40,
legend=True
)
# Force legend customization
leg = ax.get_legend()
if leg:
leg.set_title("Publisher")
leg._loc = 2
leg.set_bbox_to_anchor((1.05, 1))
leg.set_frame_on(False)
for t in leg.texts:
t.set_fontsize(9)
leg.set_title("Publisher")
else:
print("⚠️ Legend not found.")
plt.title('Distribution of Books by Year and Publisher (before 1900)')
plt.xlabel('Year of Publication')
plt.ylabel('Book Count')
sns.despine()
plt.tight_layout(rect=[0, 0, 0.85, 1])
plt.show()